Principal Components Analysis (PCA)- A tool for dimensionality reduction

The purpose of PCA and some intuition

If you're familiar with how PCA works feel free to skip straight to the coding part - if not, read on!

In this era of Big Data, it can be possible to have too much data. We might have so much data that it makes our brains or computers crash when we try to process it. What can we do when this happens?

The simplest answer is to just throw some of the data away. If we have a dataset $X$ with $n$ examples and $p$ features, we could get rid of some of the examples to have a more manageable dataset of size $m \times p$ (where $m < n$) or we could randomly pick some features and throw them away to yield a dataset of size $n \times q$.

This approach, whilst quick and easy, is far from optimal. Just throwing information away usually makes our models worse and modelling complex relationships between features is hard enough already!

We still want to reduce the size of our dataset though, so consider the following approach: We sift through each feature in the dataset and look for ones we can get rid of. How do we decide if a feature is a suitable candidate to be discarded? Let's think of an example: Suppose we were trying to use information about houses (size in $m^2$, distance to the nearest shop, number of bedrooms etc.) to predict house prices and we noticed that all of the houses in our dataset were 3 bedroom houses. The number of bedrooms feature would be an excellent feature to be discarded because there is no variability for that feature in the dataset - it doesn't allow us to discriminate between different houses and so is of no use to us.

Although it's an improvement on the random deletion approach, sifting through all of the features suffers from two main drawbacks:

  1. It would be rare that a feature conveys absolutely no information at all (as is the case in the housing example above) and we'd like to hang on to as much information as we can.
  2. The approach of looking at every variable in a high-dimensional dataset (which might have hundreds of features) is something no one should be forced to do unless there's a very good reason to! We need an approach that can easily be automated and scales nicely.

PCA - Finding the directions of maximal variation

Principal components analysis operates using a similar principle to the one we described above. It works as follows for a $n \times p$ dataset $X$.

  1. Find the direction in p-dimensional space where the data exhibits the most variation (Note: this direction is not limited to the p axes, it can be literally any direction at all). This is called the Principal Component.
  2. Find the direction in p-dimensional space which is orthogonal to the Principal Component which exhibits the most variation. This direction is called the Second Principal Component
  3. Find the direction in p-dimensional space which is orthogonal to both the Principal Component and the Second Principal Component which exhibits the most variation. This direction is called the Third Principal Component.
  4. Continue this process until you have p Principal Components - each one orthogonal to all the others.

These $p$ Principal Components define a new coordinate system for the data - the key point to recognise is that the data are still the same, they're just being represented differently. If you've not thought about changing coordinate systems before, it might help to think of a real world example where we use alternative coordinate systems:

An example of changing coordinates - Different representations of colour: RGB vs CMYK

One way of precisely defining the shade of a colour is to express it using the RGB system - although we're skipping over the finer details here, we express a colour as a vector $(x_1, x_2, x_3)$ where $x_1, x_2, x_3$ represent the intensities of red, green and blue, respectively. For example:

  • (255, 0, 0) represents pure red
  • (0,255,0) represents pure green
  • (255, 255, 0) represents pure yellow

Instead of using red, green and blue as the basis for describing a colour, some processes (most notably your printer) use an alternative basis. They use the respective intensities of cyan, magenta and yellow to describe a colour. Under this system, we would represent pure yellow as (0, 0, 255) - it's exactly the same colour that we would get representing it as (255, 255, 0) under the RGB system, but it just has a different representation under a different coordinate system.

Back to the Principal Components

Changing coordinate systems in this way gives us a new dataset $X^*$ (still with dimensions $n \times p$). Going back to the house prices example above, our features no longer have convenient names such as size in $m^2$, but they do have the benefit that we know which ones have the most variance (and are therefore theoretically most likely to be useful in modelling the label, which is our goal). If we want to work with a smaller dataset, we can now just discard the $(k+1)^{th} \text{ to } p^{th}$ Principal Components and fit a model using the first $k$ Principal Components. In theory these $k$ features will contain more information than any $k$ features in the original dataset and should give us a better model for the same computational expense.

Data Generation

Before we describe how we actually go about finding the Principal Components, let's generate the dataset we're going to transform and visualise it

In [1]:
#Import modules
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
n = 200 #Number of observations

x1 = np.random.uniform(0,10,n)
x2 = x1 + np.random.normal(0,2,n)

X = pd.DataFrame({'X1':x1, 'X2':x2})
In [3]:
X.head()
Out[3]:
X1 X2
0 8.270146 5.851099
1 3.032739 5.250034
2 0.428308 -0.212958
3 4.836655 6.021684
4 3.335869 5.326569
In [4]:
plt.scatter(x1, x2)
plt.xlim(-5,15)
plt.ylim(-5,15)
plt.show()

This dataset only has two variables, so ordinarily we wouldn't bother reducing its dimension because it's manageable enough already. However for the purposes of illustration it's easy for us to visualise the change when we turn a 2D dataset into a 1D dataset.

Now it might not have been clear above exactly how we would go about finding out which direction in the 2D plane has the most variation. To do so, we have to project the data onto that direction. Lets say we have a point $(x_1, x_2)$ and the direction we want to project it onto is the line $l$ with equation $X_2 = m \times X_1 + c$. To project $(x_1, x_2)$ onto the line $l$ we simply draw a line which is perpendicular to $l$ and mark the point where they intersect. That intersection point is then the projection of $(x_1, x_2)$ onto $l$.

Below is a few of examples of projecting our dataset $X$ onto different lines

In [5]:
line1 = np.array([0,1]) #The line x = 0
line2 = np.array([np.sqrt(2), np.sqrt(2)]) #The line y = x
line3 = np.array([-np.sqrt(2),np.sqrt(2)]) #The line y = -x

def projectPointOntoLine(point, line):
    
    return (np.dot(point, line)/np.dot(line, line))*line

projLine1 = np.array([projectPointOntoLine(x, line1) for idx, x in X.iterrows()]) #Projection onto line 1
projLine2 = np.array([projectPointOntoLine(x, line2) for idx, x in X.iterrows()]) #Projection onto line 2
projLine3 = np.array([projectPointOntoLine(x, line3) for idx, x in X.iterrows()]) #Projection onto line 3


fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,6))
fig.suptitle('Projecting X onto different lines')
ax1.scatter(projLine1[:,0], projLine1[:,1])
ax1.set_xlim((-5,15)); ax1.set_ylim((-5,15))
ax1.title.set_text('The line x = 0')

ax2.scatter(projLine2[:,0], projLine2[:,1])
ax2.set_xlim((-5,15)); ax2.set_ylim((-5,15))
ax2.title.set_text('The line y = x')

ax3.scatter(projLine3[:,0], projLine3[:,1])
ax3.set_xlim((-5,15)); ax3.set_ylim((-5,15))
ax3.title.set_text('The line y = -x')

Immediately we can see that the points are more spread out when we project onto the line $y = x$ than if we project onto the line $y = -x$. This gives us a good indication that there is more variability in the direction given by the line $y = x$ than in the direction given by the line $y = -x$.

This is of course a very ad-hoc and non-rigourous method for determining the direction in which the data exhibit the most variation.

Finding the Principal Components

We won't go into a full derivation of how to find the Principal Components - there are lots of resources on the internet where you can find such a proof.

To obtain the Principal Components for a dataset, $X$ of dimension $n \times p$:

  1. Compute the covariance matrix of $X$. The $p \times p$ covariance matrix is computed as $S = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar x)(x_i - \bar x) = \frac{1}{n-1}X^TX$, where $x_i$ is a single observation of dimension $p$.
  2. Compute all $p$ eigenvalues $(\lambda_1, ..., \lambda_p)$ for $S$ and their associated eigenvectors $(v_1,...,v_p)$, such that $Sv_i = \lambda_iv_i$
  3. Each eigenvector represents a direction, and the Principal Component is the eigenvector corresponding to the largest eigenvalue. The 2nd Principal Component is the eigenvector corresponding to the second largest eigenvalue and so on.
  4. To compute a reduced-dimension dataset of dimension $n \times k$, $X^*$, compute $X^* = XV_{1:k}$, where $V_{1:k} = [v^*_1, v^*_2,...,v^*_k]$, the matrix of the first k Principal Components concatenated together (of dimension $p \times k$. This step projects each of the examples onto each of the first $k$ Principal Components

If you just wanted to get on with implementing the algorithm, here is the part to start paying attention again!

Implementing PCA

In [6]:
class PCA:
    
    def __init__(self, data):
        self.data = data
        self.dataArray = data.to_numpy() #Occasionally we'll want to multiply arrays together 

        
    def computeCovMatrix(self):
        #Given an n x p input matrix, X, compute the associated covariance matrix
        #defined as 1/(n-1) * transpose(X) * X
        
        n = self.data.shape[0]
        self.covMat = 1/(n-1)*np.matmul(np.transpose(self.dataArray), self.dataArray)
        return 0
        #Covariance matrix is a square matrix of size p x p
    
    def getEigenvalueDecomposition(self):
        #Input is the p x p cov matrix
        
        #We want to compute the eigenvalues and eigenvectors of the covariance matrix
        
        self.eigVals, self.eigVecs = np.linalg.eig(self.covMat) #eigVals is a vector of length p, eigVecs is a p x p matrix
        #The first element of eigVals is the eigenvalue associated with the first eigenvector in eigVecs (and so on)
        
        return 0
    
    def fitPCA(self, numComponents):
        
        #Obtain cov matrix and decompose it
        self.computeCovMatrix()
        self.getEigenvalueDecomposition()
        
        #Now we want to obtain the eigenvectors associated with the largest numComponents eigenvalues
        #There's a few ways to do this, one is to make a dictionary where they keys are eigenvalues and the 
        #values are eigenvectors - in that way we establish a link between eigenvalue and eigenvector, so we
        #can sort the eigenvalues and still know which eigenvector we need
        
        self.eigDict = {}
        
        for idx, eigenValue in enumerate(self.eigVals):
            self.eigDict[eigenValue] = self.eigVecs[:,idx] #Store in a dictionary
            
        self.Vk = [] #This will be the p x k matrix made up of the first k eigenvectors
        for eigenValue in list(pd.Series(list(self.eigDict.keys())).sort_values(ascending = False))[0:numComponents]:
            self.Vk.append(self.eigDict[eigenValue])
         
            
            
        self.Vk = np.array(self.Vk).reshape(-1,numComponents) #Convert to be an array and make sure array dimensions match up
        
        
        #Now that we have the first K eigenvectors stored in a new matrix, we can convert our original dataset 
        #into the dimension-reduced form
        
        self.dataReduced = np.matmul(self.dataArray, self.Vk)
        
        #Finally lets clean up turn this into a pandas dataframe and return it
        
        outDict = {}
        
        for i in range(numComponents):
            outDict[f'PC{i+1}'] = self.dataReduced[:,i]
            
        outDF = pd.DataFrame(outDict)
        
        
        return outDF
    
    
    
In [7]:
myPCA = PCA(X)
In [8]:
XStar = myPCA.fitPCA(2)
In [9]:
XStar.head()
Out[9]:
PC1 PC2
0 -9.934993 -1.981563
1 -5.897294 1.407936
2 -0.139882 -0.457419
3 -7.697960 0.628734
4 -6.161300 1.240464

Visualising the results

In [10]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(20,10))
ax1.scatter(XStar['PC1'], XStar['PC2'])
ax1.set_xlim((-20,5))
ax1.set_ylim([-20, 5])
ax1.set(xlabel='Principal Component', ylabel='Second Principal Component', title = 'Principal Components')

#Also want to plot the original data with the first and second pc running through

#First PC
principalComponent = myPCA.Vk[:,0]
pcDirection = principalComponent[1]/principalComponent[0]

#Second PC
principalComponent2 = myPCA.Vk[:,1]
pcDirection2 = principalComponent2[1]/principalComponent2[0]


ax2.scatter(X['X1'], X['X2'])
ax2.set_xlim((-5,15))
ax2.set_ylim([-5,15])
ax2.set(xlabel='X1', ylabel='X2', title = 'Original Data')
#Plot their lines
ax2.plot(np.linspace(-5,15), pcDirection*np.linspace(-5,15), c = 'g',linewidth = 6, label = 'Principal Component')
ax2.plot(np.linspace(-5,15), 10 + pcDirection2*np.linspace(-5,15), c = 'r', linewidth = 6, label = '2nd Principal Component')
ax2.legend()

plt.show()

We can see that the x axis of the left plot, which corresponds to the first Principal Component, exhibits a lot more variation than the corresponding y axis

On the right hand plot we've plotted the original dataset with the directions corresponding to the Principal Components running through the data, the direction with the greatest variance looks like we might have intuitively expected it to.